Background: Shotgun metagenomes are often assembled prior to annotation of genes which biases the functional\ncapacity of a community towards its most abundant members. For an unbiased assessment of community function,\nshort reads need to be mapped directly to a gene or protein database. The ability to detect genes in short read\nsequences is dependent on pre- and post-sequencing decisions. The objective of the current study was to\ndetermine how library size selection, read length and format, protein database, e-value threshold, and sequencing\ndepth impact gene-centric analysis of human fecal microbiomes when using DIAMOND, an alignment tool that is\nup to 20,000 times faster than BLASTX.\nResults: Using metagenomes simulated from a database of experimentally verified protein sequences, we find that\nread length, e-value threshold, and the choice of protein database dramatically impact detection of a known target,\nwith best performance achieved with longer reads, stricter e-value thresholds, and a custom database. Using\npublicly available metagenomes, we evaluated library size selection, paired end read strategy, and sequencing\ndepth. Longer read lengths were acheivable by merging paired ends when the sequencing library was size-selected\nto enable overlaps. When paired ends could not be merged, a congruent strategy in which both ends are\nindependently mapped was acceptable. Sequencing depths of 5 million merged reads minimized the error of\nabundance estimates of specific target genes, including an antimicrobial resistance gene.\nConclusions: Shotgun metagenomes of DNA extracted from human fecal samples sequenced using the Illumina\nplatform should be size-selected to enable merging of paired end reads and should be sequenced in the PE150\nformat with a minimum sequencing depth of 5 million merge-able reads to enable detection of specific target\ngenes. Expecting the merged reads to be 180-250 bp in length, the appropriate e-value threshold for DIAMOND\nwould then need to be more strict than the default. Accurate and interpretable results for specific hypotheses will\nbe best obtained using small databases customized for the research question.
Loading....